Predicting tumor recurrence on baseline MR imaging in patients with early-stage hepatocellular carcinoma using deep machine learning

Tumor recurrence affects up to 70% of early-stage hepatocellular carcinoma (HCC) patients, depending on treatment option. Deep learning algorithms allow in-depth exploration of imaging data to discover imaging features that may be predictive of recurrence. This study explored the use of convolutional neural networks (CNN) to predict HCC recurrence in patients with early-stage HCC from pre-treatment magnetic resonance (MR) images. This retrospective study included 120 patients with early-stage HCC. Pre-treatment MR images were fed into a machine learning pipeline (VGG16 and XGBoost) to predict recurrence within six different time frames (range 1–6 years). Model performance was evaluated with the area under the receiver operating characteristic curves (AUC–ROC). After prediction, the model’s clinical relevance was evaluated using Kaplan–Meier analysis with recurrence-free survival (RFS) as the endpoint. Of 120 patients, 44 had disease recurrence after therapy. Six different models performed with AUC values between 0.71 to 0.85. In Kaplan–Meier analysis, five of six models obtained statistical significance when predicting RFS (log-rank p < 0.05). Our proof-of-concept study indicates that deep learning algorithms can be utilized to predict early-stage HCC recurrence. Successful identification of high-risk recurrence candidates may help optimize follow-up imaging and improve long-term outcomes post-treatment.

Some recent studies have studied the value of machine learning algorithms to predict HCC recurrence [10][11][12][13] . So far, these studies have used a predefined catalog of manually hand-crafted features to predict HCC recurrence 10,12,14 . This approach has some drawbacks, since manually hand-crafted features may not capture cancer's pathophysiological properties entirely. That means, they are created by humans and may be subject to selection bias. In contrast, deep learning algorithms can automatically extract the most relevant and predictive imaging features, without human engineering. Deep learning algorithms can also be utilized in combination with medical imaging data, to support a wide range of applications, including automated segmentation, lesion classification, and abnormality detection 15 . Its hallmark lies in the advanced ability to deal with unstructured data, such as medical images 16 , and to automatically extract relevant imaging features that may not be apparent to even the most experienced human eyes 17 .
Even though there are already studies on HCC recurrence prediction utilizing deep learning algorithms, most of these studies use CT imaging as input 18 . In contrast, we ask whether using MR imaging instead of CT is more promising due to the higher soft tissue contrast of MR imaging compared to CT. To the best of our knowledge, there is no study utilizing deep learning for HCC recurrence prediction in combination with MR imaging for early-stage HCC.
This study aims to build, validate, and test a deep learning approach to predict the post-treatment (surgical, ablative, or OLT) recurrence risk in early-stage HCC patients from pre-treatment MR imaging.

Methods
Patient population. This was a single-center retrospective study compliant with the Health Insurance Portability and Accountability Act (HIPAA). This study was approved by the institutional review board of Yale University and informed consent was waived. The study was performed in accordance with the Declaration of Helsinki. One hundred and twenty adults (≥ 18 years) from 2005 to 2018 were included. From a total of 1190 patients treated from January 2005 to December 2018 for HCC at our institution, 120 patients were included according to the following criteria: (1) either histopathologically or imaging-criteria confirmed HCC, (2) either stage 0 or stage A, according to the most recent BCLC staging system 19 , (3) surgical resection, thermal ablation, or OLT as first-line, stand-alone treatment after initial listing/consideration for transplant in all three scenarios, (4) available multi-parametric contrast-enhanced pre-treatment MR imaging, and (5) complete response after treatment, defined as the disappearance of all signs of viable tumor at first post-interventional follow-up imaging. Exclusion criteria were as follows: (1) imaging confirmed macrovascular invasion or metastatic HCC, (2) presence of active co-malignancies from the time of HCC diagnosis through the entire follow-up period, and (3) excessive motion artifact on pre-treatment MR imaging. Pre-treatment clinical variables were collected. Patient data were extracted from the Electronic Health Records. Further information regarding patient characteristics can be found in Table 1.
To ensure all patients were diagnosed according to current imaging diagnostic criteria, all MR images were read by three board-certified radiologists (two of them with 7, and one of them with 8 years of experience in body MR Imaging, respectively) according to the most recent Liver Reporting and Data Systems (LI-RADS v2018) 20 criteria. The dataset was split into three portions. Each radiologist read a portion in independent reads. Patients with at least one LI-RADS (LR)-5 lesion were considered diagnosed with HCC. For patients not meeting LR-5 imaging criteria, biopsy or postoperative pathological confirmation of HCC was considered sufficient for diagnosis. Clinical endpoint. Recurrence was defined as the intra-or extrahepatic appearance of HCC after treatment, confirmed by multi-parametric contrast-enhanced imaging or histology. Recurrence included intrahepatic local (within the liver, at the place of the original tumor), intrahepatic distant (within the liver, distant from the original tumor), and extrahepatic (outside the liver) recurrence. Recurrence-free survival (RFS) was defined as the time interval between curative treatment to first imaging evidence of recurrence. Multi-parametric contrastenhanced MR imaging or computer tomography (CT) of the abdomen was used for follow-up imaging, including pre-contrast, arterial, portal venous, and delayed phase imaging. Patients received imaging every 3 months in the first post-procedure year and every 6 months after that. To monitor for extrahepatic recurrence, patients received a non-contrast chest CT every 6 months. Imaging acquisition. Acquisitions from multiple MR imaging scanners with different contrast agents were used to allow for a more robust and generalizable machine learning model. Scanners were 1.5 T and 3 T models produced by Siemens (Aera, Avanto, Espree, TrioTim, Skyra, Verio), General Electric (SIGNA HDx), and Philips (Achieva). Studies were performed with several contrast agents, including Gadavist (Bayer), Magnevist (Bayer), Eovist (Bayer), Multihance (Bracco Diagnostics), Prohance (Bracco Diagnostics), Optimark (Covidien), and Dotarem (Guerbet). Contrast agents were administered at a dose of 1-1.5 mmol/kg with an injection speed of 2-5 mL/s. The input for our model consisted of multi-parametric, gadolinium-enhanced T1-weighted gradient echo acquisitions including arterial phase (35 s post-injection), portal venous phase (70 s post-injection), and delayed phase (3-5 min post-injection) imaging. Repetition and echo times ranged from 2.5-5.5 ms and 1-3 ms, respectively. Pixel bandwidth ranged from 250 to 650 Hz, pixel spacing from 0.5 to 1.7 mm, slice thickness from 3 to 5 mm, number of slices from 43 to 125, and image matrices from 160 × 160 to 415 × 200.
Machine learning pipeline. Image pre-processing. A rectangular, three-dimensional bounding box was placed around the liver in the pre-treatment, multi-parametric MR axial images using 3D Slicer (Version 4.10.2) 21 . The task was performed by a third-year radiology resident, supervised by board-certified subspecialty-trained readers with 7 and 8 years of experience, as detailed above. The whole liver was chosen as the re- www.nature.com/scientificreports/ gion of interest to capture most of the potentially predictive imaging features. The bounding box was created based on the arterial phase and then adapted to the portal venous and delayed phase images. An exemplary representation of a bounding box can be found in Supplementary Fig. 2. After that, corresponding slices of all three phases were resized to 224 × 224 pixels and combined into one stack. The resizing steps were performed to create a consistent input matrix size of 224 × 224 × 3 pixels required for feature extraction.
Feature extraction. Feature extraction and voxel intensity normalization were performed using VGG16 22 . This convolutional neural network (CNN) was pre-trained on a subset of the ImageNet database. The original pre- www.nature.com/scientificreports/ trained weights were used for feature extraction without fine-tuning the neural network. It was chosen due to its robust performance in various domains 23 , and it is an effective method when only a limited amount of imaging data is available. The voxel intensities were normalized using z-score normalization, as per given the formulae: x n ʹ = (x n − µ)/σ, where x n is the original voxel intensity value, µ the mean of all voxel intensity values per patient, σ the standard deviation of all voxel intensity values per patient, and x n ʹ the normalized voxel intensity value. In total, 4096 features per stack were extracted from the 2nd fully connected layer of the CNN. Finally, the maximum value per feature across all MR image stacks was pooled to obtain a single feature vector of 4096 features per patient. A more detailed overview of the pre-processing and feature extraction can be found in Supplementary Fig. 1.
Recurrence prediction. The final imaging feature vector was fed into XGBoost 24 , a state-of-the-art tree-based ensemble classifier. To validate the algorithm, nested cross-validation (CV) 25 was performed, consisting of an outer-and an inner loop. In the outer loop, leave-one-out CV was used to evaluate the model performance, meaning that all patients but one were used to train, and the remaining one was used to validate the model. In the inner loop, tenfold stratified Monte Carlo CV was used to optimize the model hyperparameters. For this, patients were randomly assigned to training (90%) and validation (10%) sets over six iterations. Figure 1 depicts an overview of our machine learning workflow.
Outcome endpoint. The Kaplan-Meier method was used to investigate our machine learning pipeline's clinical relevance and efficiency, with RFS as the outcome of interest. Patients were stratified by the predictions of the models into "high-risk" and "low-risk" recurrence. As the machine learning model only provides a recurrence probability, a specific threshold had to be chosen, against which the patients are stratified. For each time frame, the threshold was selected by dividing the number of all recurrent by the number of all cases. Patients who recurred during the analyzed time frame were marked as events, whereas patients who remained cancerfree were censored. Survival curves were compared using the log-rank test. A p-value below 0.05 was considered significant.
Statistical analysis. The median and interquartile ranges were used to summarize patient characteristics.
Recurrent and non-recurrent patients were compared using the Chi-square test (for categorical and ordinal variables) and the Mann-Whitney U test (for continuous variables). Bonferroni correction was used to adjust the significance level for the family-wise error rate in the baseline characteristics. The area under the receiver operating characteristic curve (AUC-ROC) was used to evaluate the machine learning model's performance. All statistical analysis was performed in Python (Version 2.7.3).
All MR images were re-examined according to the most recent LI-RADS criteria. In total, 93 patients (77.5%) were confirmed as being HCC-diagnosed with at least one LR-5 lesion on the pre-treatment MR imaging. The remaining 27 (22.5%) patients presented at most with one LR-4 (n = 17, 14.2%), LR-3 (n = 7, 5.8%), or LR-M (n = 3, 2.5%) lesion. Among these subjects, HCC was confirmed in 17 (14.2%) patients with liver biopsy and 10 (8.3%) patients with postoperative pathology. Table 2 summarizes the results of the LI-RADS reading. Table 3  Recurrence-free survival. Figure 2 shows the Kaplan-Meier plots for RFS. To investigate the algorithm's clinical relevance the final recurrence predictions were analyzed with the Kaplan-Meier method. According to the model's predictions, patients were stratified into "high-risk"-and "low-risk"-recurrence groups. The number of patients in each analysis corresponds to the number of patients in the corresponding recurrence prediction models, namely 120 (12), 116 (26), 98 (36), 82 (40), 74 (43), and 66 (44) for the 1-6 years models, respectively.

Prediction of recurrence.
The analysis revealed a significant difference between predicted low-and high-risk patients in five of the six different recurrence time frames, namely for 2-, 3-, 4-, 5-, and 6-year recurrence time points. (log-rank test, p < 0.05). No statistical difference was obtained for the 1-year analysis (p = 0.06). A Kaplan-Meier analysis for each time frame and treatment can be found in Supplementary Fig. 3.

Discussion
This study aimed to build, validate and test a deep learning approach to predict the post-treatment early-stage HCC recurrence risk from pre-treatment MR imaging. For all pre-defined time frames (ranging from 1 to 6 years), our model was able to predict tumor recurrence using baseline imaging data with moderate to high accuracy and test AUCs between 0.71 to 0.85. The subsequent Kaplan-Meier analysis showed that such predictions allow patients to be stratified into risk groups with significant differences in expected RFS at five recurrence time frames, with greater levels of certainty for the later time points. Overall, this study demonstrates that machine learning algorithms, specifically CNNs in combination with decision trees, can detect MR imaging features of HCC and background liver parenchyma associated with tumor recurrence in treated early-stage disease.
The relatively high prevalence of early disease recurrence following curative or potentially curative therapies of HCC continues to impede HCC, with 5-year recurrence rates being as high as 80% 3,26-28 . Among other unmet needs, current predominantly non-functional and lesion-size-based imaging criteria have proven to be of limited   www.nature.com/scientificreports/ value when differentiating between patients with high-risk and low-risk for recurrence. There is scientific consensus that novel biomarkers for improved prediction of outcome are an unmet need with the hope to develop improved liver transplantation criteria that would incorporate factors beyond basic tumor morphology on crosssectional imaging (number and size) 29 . Along those lines, our study was designed to apply currently available deep learning techniques to this unmet need. Our data demonstrated that this neural network generated reliable outcome prediction based on disease morphology on baseline imaging with a performance similar to or better than previously reported in related works 12,13,[30][31][32][33] . While previous studies mainly used manually hand-crafted features based on human expert knowledge, our approach relies on automated feature extraction using a CNN. Overall, the clinical applicability of deep learning-based prediction of recurrences using baseline MR imaging will remain the subject of further study. A direct comparison with prior works on HCC recurrence prediction is limited by the fact that most of the published studies use CT 34 instead of MR imaging which is far less available for staging or screening purposes, particularly in Asia. Lv et al. 35 proposed a deep-learning-based radiomics model to predict the 3-year recurrence of resected HCC from CT imaging, with a testing AUC of 0.83, notably higher than ours (0.71). Wang et al. 18 presented a comparable study design to predict early HCC recurrence with an AUC of 0.825. Nevertheless, it remains difficult to deduct which modality is preferable given the use of different machine learning approaches, inclusion criteria, and number of patients.
A major technical advantage of the presented work is the use of the whole liver volume for algorithmic input rather than relying on segmented regions of the tumor or single slices. In comparison, Kim et al. 10 attempted to predict early recurrences of HCC using quantitative imaging features, with the best results being obtained using a model that included the peritumoral environment (3 mm) rather than the tumor alone, impressively underlining the need to look beyond the malignancy itself. Additionally, Ji et al. 36 demonstrated that in patients with cirrhosis, texture-specific imaging features of the liver background could be used to predict HCC recurrence, confirming that the underlying hepatic microenvironment as a whole may be reflective of the risk for recurrence. As HCC can be considered a secondary disease arising on top of a cirrhotic background, recurrent cancer may not only develop at the treated site but also de-novo. Therefore, whole liver input may overall be useful to account for underlying parenchymal conditions as an additional source of disease recurrence.
In the study, we chose the investigated recurrence period to be up to 6 years, as most post-operative HCC recurrences occur within the first 5-6 years. Our Kaplan-Meier analysis showed a statistically significant Kaplan-Meier analysis of RFS for each of the analyzed time frames according to algorithm predictions. The red curves represent patients the algorithm classifies as "recurrent" (High-risk), while the green curves represent patients the algorithm classifies as "non-recurrent" (Low-risk). Each event on the green curve represents prediction inaccuracy (false-negative) while each event on the red curve represents an accurate prediction of recurrence (positive predictive value). www.nature.com/scientificreports/ difference between low-and high-risk patients in five different time-to-recurrence (TTR) time frames with AUCs between 0.75 and 0.85. But our model was not statistically significant within 1 year, which is likely due to the small number of recurrent cases (n = 12) and the large data imbalance between recurrent and nonrecurrent cases (12 vs. 108). Our study has some limitations. First, this is a retrospective single-center study with a relatively small cohort. The lack of external validation can be a source of bias and external algorithmic validation would be necessary for larger datasets before clinical deployment. Therefore, in light of the relatively small cohort and to prevent overfitting the data, we did not further pursue an independent test set at this point (e.g. 20% of the patients). However, the use of nested CV gives some confidence in the robustness of the learned models. As demonstrated in Table 1, we included nine HCC patients with end-stage liver cirrhosis (Child-Pugh-class C). According to former BCLC versions, these patients would haven been classified as terminal stage (BCLC D) disease patients and therefore no longer meet the inclusion criteria of our study. However, the most recent updates of the BCLC staging system 19 allows for patients to be classified as early-stage and therefore be considered for transplant if their tumor burden meets transplant criteria. In our study, all nine patients were treated with OLT and therefore met the BCLC 2022 guidelines. In observance of the new BCLC 2022 algorithm, one patient in our cohort is formally staged as BCLC B but was downstaged to early-stage and ultimately met transplant criteria. Last, the LI-RADS reading was done as a split, with each of the radiologists reviewing a subset of the cases. A design where each radiologist reviewed all cases would have been preferable.
As we further our research and its application we intend to include various clinical features and modify the deep neural network to investigate whether there is an improvement in prediction accuracy. Some clinical parameters are already known to be associated with the recurrence of HCC, including gender, age, AFP 37 , NLR, or PLR 38 . We are hopeful that Ensemble methods that combine several predictors may potentially yield higher accuracy and add to the algorithm's utility.
In conclusion, our work serves as proof of principle that deep learning-based algorithms can predict recurrence from pre-treatment MR imaging in patients with early-stage HCC. This study suggests that deep learningbased extraction of imaging features from baseline diagnostic imaging could be used to individualize treatment options with prognostications and adjust post-treatment follow-up in the sense of precision medicine. Future work should focus on enhancing the reliability of these algorithms with multi-center prospective cohort studies and the incorporation of clinical data.

Data availability
Data can be made available upon reasonable request to the corresponding author.